Overview and Motivation:

Some of our team members are loyal fans of the popular free online card game, Hearthstone: Heroes of Warcraft, which was released worldwide by Blizzard on 2014 with more than 40 million registered Hearthstone accounts by November 2015.

The main element of the game Hearthstone are cards, which consist of a list of features including cost, attack (number of damages can be made to the opponent per turn),health (number of damages that can bear before being destroyed) and other special abilities. Here is an example of the card:

Before every game starts, each of the two players will choose 1 hero mode among the 9 and then select 30 different cards over 700 cards to build his/her own deck depending on the mode. Each turn, the player will draw one card randomly from the 30 cards and one more mana crystal (money). The player can choose the cards to use among all those in hand that cost up to the mana crystals he/she has by that turn. The game ends when one player is attacked to death (lose all 30 units of health) or he/she concedes, and the other player will win.

Therefore, the initial building of the 30 cards, as well as the choices of cards to use during the game will directly influence the results of the game. This motivated us:

Initial Questions:

1. What are the “true” values of individual cards? Are there any properties the Blizard company used to assign values (cost) of these cards? Is there any card undervalued/overvalued by the company?_

2. What is the balance between low cost cards and high cost cards?_

3. Are there any “core” combination of cards?_

4. Are we able to build a powerful deck (30 cards) for some heros?_

5. Test the deck we built (optional)_ * We can test our model by simulating games using the deck and strategy we developed, and calculate its percentage of winning.

Terminology of HearthStone:

Card: A token can be played in the game.

Deck: a combination of 30 cards in which one normal card can have two duplicates, and one legendary card can appear only once.

Minion: a type of card that can summon a “monster” fight for you.

Tempo: mosty define by the costs of the cards in the deck. The more low-cost cards, the faster the pace/tempo.

Early Hand: the player who play first.

Late Hand: the player who acts later.

Library

Here are the libraries we have used in our project.

library(rjson)
library(dplyr)
library(tidyr)
library(knitr)
library(readr)
library(stringr)
library(ggplot2)
library(gridExtra)
library(graphics)
library(grid)
library(ggrepel)
library(scales)
library(cowplot)
library(rvest)
library(XML)
library(vegan)
library(RColorBrewer)
library(gplots) 
library(devtools)
library(reshape)
library(dendextend)
library(reshape2)
library(VGAM)

Data:

We have two types of data: 1) basic card information (attack/health/cost/description of cards) and 2) frequently used decks from top players.

Basic Card information

  • We collected card data from Hearthstone’s API: http://hearthstoneapi.com, Card. It contains thorough card information of: card name, card set (Basic, Classic, Blackrock Mountain, Goblins vs Gnomes, Naxxramas, Promotion, The Grand Tournament, The League of Explorers), card type (minions, spells, weapons), player class (Priest, Hunter, Warlock, Shaman, Warrior, Mage, Paladin, Druid, Rogue, All), card description (text), mechanics of minions (Battlecry, Taunt, Normal, HealTarget, Charge, Aura, AdjacentBuff, Spell Damage, Freeze, Enrage, Divine Shield, Stealth, Overload, Deathrattle, Windfury, Combo, Poisonous, Inspire), attack, health, and cost of cards. We have collected the json file using Python and transfer it into RData (and csv files) using R.
## Data wrangling from json to RData:
json_file = "cards2.txt"
data <- fromJSON(file = json_file)
card_category = names(data)

not_empty = which(sapply(1:length(data), function(i){length(data[[i]])})>0)

card_category = card_category[not_empty]

data = lapply(not_empty, function(i){data[[i]]})
data1 = lapply(1:length(data), function(k) {lapply(data[[k]],
                                                   function(i) {lapply(i, function(j){
                                                     j = ifelse(is.null(j),NA,j)})})})

col_names = lapply(1:length(data1),
                   function(k) {
                     lapply(1:length(data1[[k]]),
                            function(i) {names(data1[[k]][[i]])})})

data2 = lapply(1:length(data1), 
           function(k) {
             lapply(1:length(data1[[k]]),
                    function(i) {
                      matrix(unlist(data1[[k]][[i]]), 
                             ncol = length(data1[[k]][[i]]), 
                             byrow = T)})})

for(k in 1:length(data2)){
  colnames(data2[[k]][[1]]) = col_names[[k]][[1]]
  data2[[k]][[1]] = data.frame(data2[[k]][[1]])
  for(i in 2:length(data2[[k]])){
    colnames(data2[[k]][[i]]) = col_names[[k]][[i]]
    data2[[k]][[i]] = data.frame(data2[[k]][[i]])
    data2[[k]][[i]] = bind_rows(data2[[k]][[i-1]],data2[[k]][[i]])
  }
  assign(card_category[k], tbl_df(data2[[k]][[length(data2[[k]])]]))
}

final_data = get(card_category[1])
for (i in 2:length(data2)){
  final_data = bind_rows(final_data, get(card_category[i]))
}
# write.table(final_data, file = "final_data.csv", sep = "\t")
# save(final_data, file = "final_data.RData")

Data wrangling of card descriptions: This part is aimed for detailed classification of minion card descriptions (other than the mechanics they are currently classified as).

  1. For some minions that have more than one mechanics (e.g. Taunt that has deathrattle), they are only classified as one of their mechanics in the Hearthstone dataset. The following wrangling aims to classify them with all the mechanics they have with dummy variables (having certain feature = 1 vs. without certain feature = 0).
load("minions_text.RData")
minions_text = tbl_df(minions_text) %>%
  select(-cardId, -flavor, -type, -artist, -collectible, -howToGet, -howToGetGold, -img, -imgGold, -locale, -race, -faction, -elite) %>%
  mutate(playerClass = ifelse(is.na(playerClass), "All", as.character(playerClass)))

minions_text = minions_text %>% 
  mutate(text = as.character(text)) %>%
  mutate(text = gsub("<b>", "", text)) %>%
  mutate(text = gsub("</b>", "", text)) %>%
  mutate(text = gsub("\xa1\xaf", "'", text)) %>%
  mutate(text = ifelse(is.na(text), "None", text)) 

minions_text = minions_text %>%
  mutate(AdjacentBuff= ifelse(text %in% minions_text$text[grep("AdjacentBuff",minions_text$text)], 1, AdjacentBuff))%>% 
   mutate(Aura= ifelse(text %in% minions_text$text[grep("Aura",minions_text$text)], 1, 0))%>% 
   mutate(Battlecry = ifelse(text %in% minions_text$text[grep("Battlecry",minions_text$text)], 1, Battlecry))%>%
  mutate(Charge= ifelse(text %in% minions_text$text[grep("Charge",minions_text$text)], 1, Charge))%>%
 mutate(Combo = ifelse(text %in% minions_text$text[grep("Combo",minions_text$text)], 1, Combo))%>%
  mutate(Deathrattle = ifelse(text %in% minions_text$text[grep("Deathrattle",minions_text$text)], 1, Deathrattle))%>%
  mutate(Divine_Shield = ifelse(text %in% minions_text$text[grep("Divine_Shield",minions_text$text)], 1, Divine_Shield))%>%
  mutate(Enrage = ifelse(text %in% minions_text$text[grep("Enrage",minions_text$text)], 1, Enrage))%>%
  mutate(Inspire = ifelse(text %in% minions_text$text[grep("Inspire",minions_text$text)], 1, Inspire))%>%
  mutate(Overload= ifelse(text %in% minions_text$text[grep("Overload",minions_text$text)], 1, Overload))%>%
  mutate(Poisonous = ifelse(text %in% minions_text$text[grep("Poisonous",minions_text$text)], 1, Poisonous))%>%
  mutate(Windfury = ifelse(text %in% minions_text$text[grep("Windfury",minions_text$text)], 1, Windfury))
  1. The beauty of Hearthstone (and the most difficult part for quantitative analysis) is that almost every minion has its unique feature that are described in text on the card. Therefore, it’s hard to “value” a card without taking these descriptions into account while on the other hand, texts them selves are difficult to be simply quantified. The following wrangling aims to identify certain verbs (deal, restore, etc.) and nouns (attacks, healths, etc.) frequently used in the card description and tried to classifiy cards with more features that were not classified by their mechanics.
minions_text = minions_text %>%
  mutate(Choice = ifelse(text %in% minions_text$text[grep("; or",minions_text$text, ignore.case = T)], 1, 0)) %>% 
  mutate(Conditional = ifelse(text %in% minions_text$text[grep("if",minions_text$text, ignore.case = T)], 1, 0)) %>% 
  mutate(Conditional = ifelse(text %in% minions_text$text[grep("whenever",minions_text$text, ignore.case = T)], 1, Conditional)) %>% 
  mutate(Conditional = ifelse(text %in% minions_text$text[grep(",",minions_text$text, ignore.case = T)], 1, Conditional)) %>% 
  mutate(Add = ifelse(text %in% minions_text$text[grep("add",minions_text$text, ignore.case = T)], 1, 0)) %>%
  mutate(Cast = ifelse(text %in% minions_text$text[grep("cast",minions_text$text, ignore.case = T)], 1, 0)) %>% 
  mutate(Deal = ifelse(text %in% minions_text$text[grep("Deal",minions_text$text, ignore.case = T)], 1, 0)) %>%
  mutate(Destroy = ifelse(text %in% minions_text$text[grep("destroy",minions_text$text, ignore.case = T)], 1, 0)) %>% 
  mutate(Discover = ifelse(text %in% minions_text$text[grep("discover",minions_text$text, ignore.case = T)], 1, 0)) %>% 
  mutate(Draw = ifelse(text %in% minions_text$text[grep("draw",minions_text$text, ignore.case = T)], 1, 0)) %>% 
  mutate(Discard = ifelse(text %in% minions_text$text[grep("discard",minions_text$text, ignore.case = T)], 1, 0)) %>% 
  mutate(Freeze = ifelse(text %in% minions_text$text[grep("freeze",minions_text$text, ignore.case = T)], 1, Freeze)) %>% 
  mutate(Gain = ifelse(text %in% minions_text$text[grep("gain",minions_text$text, ignore.case = T)], 1, 0)) %>%
  mutate(Give = ifelse(text %in% minions_text$text[grep("give",minions_text$text, ignore.case = T)],1,0)) %>%
  mutate(Reduce = ifelse(text %in% minions_text$text[grep("reduce",minions_text$text, ignore.case = T)], 1, 0)) %>% 
  mutate(Remove = ifelse(text %in% minions_text$text[grep("remove",minions_text$text, ignore.case = T)], 1, 0)) %>% 
  mutate(Restore = ifelse(text %in% minions_text$text[grep("restore",minions_text$text, ignore.case = T)], 1, 0))%>%
  mutate(Reveal = ifelse(text %in% minions_text$text[grep("reveal",minions_text$text, ignore.case = T)],1,0)) %>%
  mutate(Silence = ifelse(text %in% minions_text$text[grep("silence",minions_text$text, ignore.case = T)], 1, 0)) %>% 
  mutate(Summon = ifelse(text %in% minions_text$text[grep("summon",minions_text$text, ignore.case = T)], 1, 0)) %>% 
  mutate(Trigger = ifelse(text %in% minions_text$text[grep("trigger",minions_text$text, ignore.case = T)],1,0)) %>%
  mutate(Number_within = ifelse(text %in% minions_text$text[grep("+[0-9]", minions_text$text)],1,0))%>%
  mutate(Attack = ifelse(text %in% minions_text$text[grep("attack",minions_text$text, ignore.case = T)], 1, 0))%>%
  mutate(Health = ifelse(text %in% minions_text$text[grep("health",minions_text$text, ignore.case = T)], 1, 0))%>%
  mutate(Damage = ifelse(text %in% minions_text$text[grep("damage",minions_text$text, ignore.case = T)], 1, 0)) %>%
  mutate(Cant = ifelse(text %in% minions_text$text[grep("can't",minions_text$text, ignore.case = T)], 1, 0)) %>%
  mutate(Nothing = ifelse(text == "None", 1, 0))

colnames(minions_text)

save(minions_text, file = "minions_text.RData")

Exploratory Analysis

theme_set(theme_bw(base_size = 16))
load("minions_text.RData")
data<-minions_text

distribution of Cost: First, we want to see how costs of cards distributed across the card databases. A great portion of cards are 2 to 4 cost, and this is a right skewed distribution. From the above plots, we have found similar patterns in attack and health as in the cost distribution. These similar patterns give us a hint of using linear model to predict cost by attack and health.

#remove costs that are "12" and "20" for these two cards are very special
Cost<-data%>%dplyr::arrange(cost)
Cost<-unique(data%>%filter(cost<=10)%>%group_by(cost)%>%dplyr::mutate(n=n())%>%ungroup()%>%select(cost,n))
# 7 stands for higher than 7
Cost1<-Cost%>%filter(cost<7)                       
Cost2<-c(7,61)
Cost<-rbind(Cost1,Cost2)

Cost<-Cost%>%dplyr::mutate(pos=cumsum(n)-n/2)
p<-Cost%>%ggplot(aes(x=1,y=n,fill=factor(cost)))
p+geom_bar(stat="identity",width=1)+geom_text(aes(x=1.6,y=pos,label = percent(n/sum(n))))+coord_polar(theta="y")+ xlab('')+ylab('')+theme(axis.text=element_blank(),axis.ticks=element_blank(),panel.grid=element_blank())+ggtitle("Pie Chart of Card Cost")

#histogram
ggplot(data, aes(factor(cost)))+ geom_bar()+scale_fill_brewer()+ggtitle("Histogram for Cost")+xlab("Cost")

Conclusion: cards with cost “2”,“3”,“4” out of the 11 possible costs occupying around 104% in total are most common in the deck

distribution of attack

Attackk<-data%>%arrange(attack)
Attack<-unique(Attackk%>%group_by(attack)%>%dplyr::mutate(n=n())%>%ungroup()%>%select(attack,n))
#histogram
ggplot(data, aes(factor(attack)))+ geom_bar()+scale_fill_brewer()+ggtitle("Histogram for Attack")+xlab("Attack")

distribution of health

Health<-data%>%arrange(health)
Health<-unique(Health%>%group_by(health)%>%mutate(n=n())%>%ungroup()%>%select(health,n))
#histogram
ggplot(data, aes(factor(health)))+ geom_bar()+scale_fill_brewer()+ggtitle("Histogram for health")+xlab("Health")+ theme(axis.text.x = element_text(angle = 0, hjust = 1))

distribution of mechanics: In the whole card dataset, there are some confounders such as mechanism, rarity and card set, that might affect our assumption of linear relationship between cost, attack and health. Take mechanism as an example, we ploted the cost, attack and health distributions of under different mechanisms. We found the pattern of these three dimensions are similar across all these mechanisms. For all these three confounders, we found all the distributions have similar patterns.

Mechanics<-data%>%arrange(mechanics)
Mechanics<-unique(Mechanics%>%group_by(mechanics)%>%dplyr::mutate(n=n())%>%ungroup()%>%select(mechanics,n))
#histogram
ggplot(data, aes(factor(mechanics)))+ geom_bar()+scale_fill_brewer()+ggtitle("Histogram for Mechanics")+xlab("Mechanics")+ theme(axis.text.x = element_text(angle = 45, hjust = 1))

*distribution of each mechanics

qplot(factor(data$mechanics),data$cost,geom="boxplot")+scale_fill_brewer()+theme(axis.text.x = element_text(angle = 45, hjust = 1,size=10))+ggtitle("Boxplots for cost in mechanics")+ylab("cost")+xlab("Mechanics")

qplot(factor(data$mechanics),data$health,geom="boxplot")+scale_fill_brewer()+theme(axis.text.x = element_text(angle = 45, hjust = 1,size=10))+ggtitle("Boxplots for health in mechanics")+ylab("health")+xlab("Mechanics")

qplot(factor(data$mechanics),data$attack,geom="boxplot")+scale_fill_brewer()+theme(axis.text.x = element_text(angle = 45, hjust = 1,size=10))+ggtitle("Boxplots for Attack in mechanics")+ylab("attack")+xlab("Mechanics")

distribution of cardSet

cs<-data%>%arrange(cardSet)
cs<-unique(cs%>%group_by(cardSet)%>%dplyr::mutate(n=n())%>%ungroup()%>%select(cardSet,n))
#pie chart
cs<-cs%>%dplyr::mutate(pos=cumsum(n)-n/2)
p<-cs%>%ggplot(aes(x=1,y=n,fill=factor(cardSet)))
p+geom_bar(stat="identity",width=1)+geom_text(aes(x=1.6,y=pos,label = percent(n/sum(n))))+coord_polar(theta="y")+ xlab('')+ylab('')+theme(axis.text=element_blank(),axis.ticks=element_blank(),panel.grid=element_blank())+ggtitle("Pie Chart of CardSet")

#histogram
qplot(data$cardSet,xlab="CardSet",main="Histogram for CardSet distribution")+theme(axis.text.x = element_text(angle = 45, hjust = 1))

*distribution of each CardSet

qplot(factor(data$cardSet),data$cost,geom="boxplot")+scale_fill_brewer()+theme(axis.text.x = element_text(angle = 45, hjust = 1,size=10))+ggtitle("Boxplots for cost in CardSet")+ylab("cost")+xlab("CardSet")

qplot(factor(data$cardSet),data$health,geom="boxplot")+scale_fill_brewer()+theme(axis.text.x = element_text(angle = 45, hjust = 1,size=10))+ggtitle("Boxplots for health in CardSet")+ylab("health")+xlab("CardSet")

qplot(factor(data$cardSet),data$attack,geom="boxplot")+scale_fill_brewer()+theme(axis.text.x = element_text(angle = 45, hjust = 1,size=10))+ggtitle("Boxplots for Attack in CardSet")+ylab("attack")+xlab("CardSet")

distribution of rarity

rr<-unique(data%>%group_by(rarity)%>%dplyr::mutate(n=n())%>%ungroup()%>%select(rarity,n))
#pie chart
rr<-rr%>%mutate(pos=cumsum(n)-n/2)
p<-rr%>%ggplot(aes(x=1,y=n,fill=factor(rarity)))
p+geom_bar(stat="identity",width=1)+geom_text(aes(x=1.6,y=pos,label = percent(n/sum(n))))+coord_polar(theta="y")+ xlab('')+ylab('')+theme(axis.text=element_blank(),axis.ticks=element_blank(),panel.grid=element_blank())+ggtitle("Pie Chart of rarity")

#histogram
qplot(data$rarity,xlab="Rarity",main="Histogram for rarity")+theme(axis.text.x = element_text(angle = 45, hjust = 1))

*distribution for each rarity

qplot(factor(data$rarity),data$cost,geom="boxplot")+scale_fill_brewer()+theme(axis.text.x = element_text(angle = 45, hjust = 1,size=10))+ggtitle("Boxplots for cost in Rarity")+xlab("Rarity")+ylab("cost")

qplot(factor(data$rarity),data$health,geom="boxplot")+scale_fill_brewer()+theme(axis.text.x = element_text(angle = 45, hjust = 1,size=10))+ggtitle("Boxplots for health in Rarity")+xlab("Rarity")+ylab("health")

qplot(factor(data$rarity),data$attack,geom="boxplot")+scale_fill_brewer()+theme(axis.text.x = element_text(angle = 45, hjust = 1,size=10))+ggtitle("Boxplots for Attack in Rarity")+xlab("Rarity")+ylab("attack")

Frequently used decks from top players

Instead of looking at the card information alone, we are trying to consider how one card interacts with others. We are using the built-up decks from top players of Hearthstone from the following website: http://www.hearthstonetopdecks.com/ A typical deck looks like this:

Data Wrangling

classes<-c("druid/","hunter/","mage/","paladin/","priest/","rogue/","shaman/","warlock/","warrior/")
removeList<-c(9,6,10,10,4,7,10,7,7)
baseURL<-"http://www.hearthstonetopdecks.com/deck-category/class/"


totalInfoDeckList<-list()

heroDeckLists<-list()

for(k in 1:length(classes)){
  class<-classes[k]
  classBaseURL<-paste(baseURL,class,"page/",sep="")
  allDecks<-list()
  for (j in 1:5){
    tableURL<-paste(classBaseURL,j,sep="")
    tables<-as.data.frame(readHTMLTable(tableURL))
    deckNames<-lapply(tables[,2],as.character)
    deckNames<-unlist(deckNames)
    
    for(i in 1:length(deckNames)){
      urlName<-tolower(gsub("\\s","-",gsub("[^\\w \\s]+","",deckNames[i],perl = TRUE),perl = TRUE))
      
      testURL<-paste("http://www.hearthstonetopdecks.com/decks/",urlName,sep="")
      tryCatch(webpage<-read_html(testURL),error=function(e){return(i)})
      
      cardNames<-webpage%>%
        html_nodes(".card-name")%>%
        html_text()
      
      cardCounts<-webpage%>%
        html_nodes(".card-count")%>%
        html_text()%>%
        as.numeric()
      
      deckId<-(j-1)*25+i
      
      deck<-cbind(cardNames,cardCounts,rep(deckId,length(cardNames)))
      
      allDecks[[deckId]]<-deck
    }
  }
  largerTable<-data.frame()
  
  for (i in removeList[k]:125){
    largerTable<-rbind(largerTable,allDecks[[i]])
  }
  
  largerTable<-largerTable%>%spread(key=V3,value=cardCounts)
  
  for (i in 2:length(largerTable)){
    largerTable[,i]<-as.numeric(as.character(largerTable[,i]))
  }
  
  largerTable[is.na(largerTable)]<-0
  
  heroDeckLists[[k]]<-largerTable
  
}

for(i in 1:9){
  totalInfoDeckList[[i]]<-heroDeckLists[[i]]%>%select(c(1,length(heroDeckLists[[i]])))
}

for(i in 1:9){
  totalInfoDeckList[[i]]<-totalInfoDeckList[[i]]%>%left_join(cards,by=c("cardNames"="name"))
}
decks<-list()
for(i in 1:9){
  decks[[i]]<-heroDeckLists[[i]]%>%gather(deckId,cardCounts,2:(length(heroDeckLists[[i]])-1))
}

Final Analysis:

Card Value Analysis

1. What are the “true” values of individual cards? Are there any properties the Blizard company used to assign values (cost) of these cards? Is there any card undervalued/overvalued by the company?

load("minions_text.RData")
## cost vs attack+health:
minions_text %>% ggplot(aes(cost)) + stat_bin(aes(y = ..count..), bins = 50 , position='dodge')

minions_text %>% mutate(attplusheal = attack+health) %>% ggplot(aes(attplusheal)) + stat_bin(aes(y = ..count..), bins = 50 , position='dodge')

From the above plots, we can found similar distributions between the cost and the sum of attach and health, where the distributions are right-skewed. Also, there seems to be some outliers that are very different from other cards.

minions_text %>%
  filter(cost > 10) %>%
  select(name, cost, attack, health, mechanics, playerClass)
## Source: local data frame [3 x 6]
## 
##              name  cost attack health mechanics playerClass
##            (fctr) (int)  (int)  (int)     (chr)       (chr)
## 1  Mountain Giant    12      8      8    Normal         All
## 2    Molten Giant    20      8      8    Normal         All
## 3 Clockwork Giant    12      8      8    Normal         All

It might be a good idea to filter out these cards.

highcost_card = minions_text %>% 
  mutate(attplusheal = attack+health) %>% filter(cost > 10) %>% 
  mutate(cost1 = 7) %>%
  select(cardId, name, cost,cost1,mechanics,Charge, Overload, attack, health)
  

minions_text = minions_text %>% mutate(attplusheal = attack+health) %>% filter(cost <= 10)

## cost vs attack+health:
minions_text %>% mutate(attplusheal = attack+health) %>% 
  group_by(attplusheal) %>%
  summarize(cost = mean(cost)) %>%
  ggplot(aes(attplusheal, cost)) + geom_point()

We can see from the above graph that higher attplusheal value (attack+health) is associated with higher mean cost.

In Hearthstone, the cost of cards is usually categorized into 0 ~ 6 and 7+. Here, we wrangled the card costs into these 8 categories and also separate them by cardSet:

## All:
minions_text = minions_text %>% 
  mutate(cost1 = ifelse(cost >= 7, 7, cost))
minions_text %>% ggplot(aes(cost1)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

minions_text %>% mutate(attplusheal = attack+health) %>% 
  group_by(cost1, attplusheal) %>% summarize(count = n()) %>%
  ggplot(aes(attplusheal, cost1, col = factor(floor(count/10)*10))) + geom_point()

## by cardSet:
minions_text %>% ggplot(aes(cost1, group = cardSet, fill = cardSet)) + stat_bin(aes(y = ..count..), bins = 20 , position='dodge')

## by attack, cost, health:
minions_bars = minions_text %>% gather(key, value, cost, attack, health)
minions_bars %>% ggplot(aes(value, group = key, fill = key)) + stat_bin(aes(y = ..count..), bins = 40, position='dodge')

## by cardSet:
## Cost:
minions_text %>% ggplot(aes(cost, group = cardSet, fill = cardSet)) + stat_bin(aes(y = ..count..), bins = 40 , position='dodge')

## Attack:
minions_text %>% ggplot(aes(attack, group = cardSet, fill = cardSet)) + stat_bin(aes(y = ..count..), bins = 40 , position='dodge')

## Health:
minions_text %>% ggplot(aes(health, group = cardSet, fill = cardSet)) + stat_bin(aes(y = ..count..), bins = 40 , position='dodge')

Since the outcome variable (Y) in our analysis is the costs of cards, which are normally integer from 0 to 7+ (all values greater than 7 are considered in the group of 7+), we adopted a model that consider ordinal polytomous outcome – cumulative logits model. Since the features’ effects (attack, cost, special abilities, etc.) should be the similar in cards with different costs, we also assumed proportional odds of these features across different cost groups. And we ended up with 7 outcome groups (cost value: 1 to 7+), we excluded cards that cost 0 mana since 1) they are usually cards that do not cost players to play and 2) the nature of these 0 cost cards are quite different from normal minion cards. In general, the cumulative logits model is in format shown below, where X is the covariate matrix, and \(\beta\) is the coefficient matrix:

\[\mbox{logit(Pr}{(Y \leq k | X_i = x_i))} = \beta_{k0} + \sum \beta_{ki}*x_i\]

Using this cumulative logits model, we are able to estimate the probability of a card being classified in each cost group (p1 to p7), and then by conditioning on the features of a card, we are able to assign a value of that card with the maximum probability among p1 to p7 (the most likely cost of a card based on its features).

Since one of our assumption that the cost of a card is proportional to the damage it can lead to, we first considered a univariate model which include attack as the only covariate:

## X: attack
## Y: cost

minions_text1 = minions_text %>% 
  mutate(mechanics1 = ifelse(mechanics %in% c("Charge","Divine Shield", "Overload", "Taunt", "Stealth", "Windfury"), mechanics, "A")) %>%
  filter(cost != 0) %>%
  arrange(cost) %>%
  mutate(Y1 = ifelse(cost == 1, 1, 0)) %>%
  mutate(Y2 = ifelse(cost == 2, 1, 0)) %>%
  mutate(Y3 = ifelse(cost == 3, 1, 0)) %>%
  mutate(Y4 = ifelse(cost == 4, 1, 0)) %>%
  mutate(Y5 = ifelse(cost == 5, 1, 0)) %>%
  mutate(Y6 = ifelse(cost == 6, 1, 0)) %>%
  mutate(Y7 = ifelse(cost >= 7, 1, 0)) 

set.seed(1001)
n_test <- round(nrow(minions_text1) / 10)
test_indices <- sample(1:nrow(minions_text1), n_test, replace=FALSE)
test <- minions_text1[test_indices,]
train <- minions_text1[-test_indices,]

fitCL = vglm(cbind(Y1, Y2, Y3, Y4, Y5, Y6, Y7) ~ attack, cumulative(parallel = T, reverse = F), data = train)
# summary(fitCL)

for(i in 1: 6){
  assign(paste("coef",i, sep = ""), as.data.frame((coef(fitCL, matrix = T)[,i])))
}

# To estimate the cost of cards based on attack:
test1 = test %>% mutate(p1 = as.numeric(exp(coef1[1,]+coef1[2,]*attack)/(1+exp(coef1[1,]+coef1[2,]*attack)))) %>%
  mutate(p2 = as.numeric(exp(coef2[1,]+coef2[2,]*attack)/(1+exp(coef2[1,]+coef2[2,]*attack))) - p1) %>%
  mutate(p3 = as.numeric(exp(coef3[1,]+coef3[2,]*attack)/(1+exp(coef3[1,]+coef3[2,]*attack))) - p1 - p2) %>%
  mutate(p4 = as.numeric(exp(coef4[1,]+coef4[2,]*attack)/(1+exp(coef4[1,]+coef4[2,]*attack))) - p1 - p2 - p3) %>%
  mutate(p5 = as.numeric(exp(coef5[1,]+coef5[2,]*attack)/(1+exp(coef5[1,]+coef5[2,]*attack))) - p1 - p2 - p3 - p4) %>%
  mutate(p6 = as.numeric(exp(coef6[1,]+coef6[2,]*attack)/(1+exp(coef6[1,]+coef6[2,]*attack))) - p1 - p2 - p3 - p4 - p5) %>% mutate(p7 = 1 - p1 - p2 - p3 - p4 - p5 - p6) %>% 
  mutate(value = 7) %>% 
  group_by(cardId) %>%
  summarize(value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p1, 1, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p2, 2, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p3, 3, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p4, 4, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p5, 5, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p6, 6, value))

test1 = test %>% left_join(test1, by = "cardId")
RMSE <- function(true_ratings, predicted_ratings){
    sqrt(mean((true_ratings - predicted_ratings)^2))
}
model1 = RMSE(test1$cost1, test1$value)
rmse_results = data_frame(method = "X: attack", RMSE = model1)

Since the cost of a card can also be influenced by the time it can survive on the stage, we also included some potential effect of health by summing up both attack and health (attack+health ) as a univariate:

## X: attplusheal
## Y: cost

fitCL = vglm(cbind(Y1, Y2, Y3, Y4, Y5, Y6, Y7) ~ attplusheal, cumulative(parallel = T, reverse = F), data = train)
# summary(fitCL)

for(i in 1: 6){
  assign(paste("coef",i, sep = ""), as.data.frame((coef(fitCL, matrix = T)[,i])))
}

# To estimate the cost of cards based on attack plus health:
test2 = test %>% mutate(p1 = as.numeric(exp(coef1[1,]+coef1[2,]*attplusheal)/(1+exp(coef1[1,]+coef1[2,]*attplusheal)))) %>%
  mutate(p2 = as.numeric(exp(coef2[1,]+coef2[2,]*attplusheal)/(1+exp(coef2[1,]+coef2[2,]*attplusheal))) - p1) %>%
  mutate(p3 = as.numeric(exp(coef3[1,]+coef3[2,]*attplusheal)/(1+exp(coef3[1,]+coef3[2,]*attplusheal))) - p1 - p2) %>%
  mutate(p4 = as.numeric(exp(coef4[1,]+coef4[2,]*attplusheal)/(1+exp(coef4[1,]+coef4[2,]*attplusheal))) - p1 - p2 - p3) %>%
  mutate(p5 = as.numeric(exp(coef5[1,]+coef5[2,]*attplusheal)/(1+exp(coef5[1,]+coef5[2,]*attplusheal))) - p1 - p2 - p3 - p4) %>%
  mutate(p6 = as.numeric(exp(coef6[1,]+coef6[2,]*attplusheal)/(1+exp(coef6[1,]+coef6[2,]*attplusheal))) - p1 - p2 - p3 - p4 - p5) %>% 
  mutate(p7 = 1 - p1 - p2 - p3 - p4 - p5 - p6) %>% 
  mutate(value = 7) %>% 
  group_by(cardId) %>%
  summarize(value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p1, 1, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p2, 2, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p3, 3, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p4, 4, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p5, 5, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p6, 6, value))

test2 = test %>% left_join(test2, by = "cardId")
model2 = RMSE(test2$cost1, test2$value)
rmse_results =  bind_rows(rmse_results, data_frame(method = "X: attplusheal", RMSE = model2))

It seemed like the univariate attack+health worked well in the model, as we testing the model in our testing set, the RMSE decreased. We also considered a model which include attack and health separately:

## X: attack, health
## Y: cost

fitCL = vglm(cbind(Y1, Y2, Y3, Y4, Y5, Y6, Y7) ~ health + attack, cumulative(parallel = T, reverse = F), data = train)
# summary(fitCL)

for(i in 1: 6){
  assign(paste("coef",i, sep = ""), as.data.frame((coef(fitCL, matrix = T)[,i])))
}


# To estimate the cost of cards based on attack and health:
test3 = test %>% mutate(p1 = as.numeric(exp(coef1[1,]+coef1[2,]*health+coef1[3,]*attack)/(1+exp(coef1[1,]+coef1[2,]*health+coef1[3,]*attack)))) %>%
  mutate(p2 = as.numeric(exp(coef2[1,]+coef2[2,]*health+coef2[3,]*attack)/(1+exp(coef2[1,]+coef2[2,]*health+coef2[3,]*attack))) - p1) %>%
  mutate(p3 = as.numeric(exp(coef3[1,]+coef3[2,]*health+coef3[3,]*attack)/(1+exp(coef3[1,]+coef3[2,]*health+coef3[3,]*attack))) - p1 - p2) %>%
  mutate(p4 = as.numeric(exp(coef4[1,]+coef4[2,]*health+coef4[3,]*attack)/(1+exp(coef4[1,]+coef4[2,]*health+coef4[3,]*attack))) - p1 - p2 - p3) %>%
  mutate(p5 = as.numeric(exp(coef5[1,]+coef5[2,]*health+coef5[3,]*attack)/(1+exp(coef5[1,]+coef5[2,]*health+coef5[3,]*attack))) - p1 - p2 - p3 - p4) %>%
  mutate(p6 = as.numeric(exp(coef6[1,]+coef6[2,]*health+coef6[3,]*attack)/(1+exp(coef6[1,]+coef6[2,]*health+coef6[3,]*attack))) - p1 - p2 - p3 - p4 - p5) %>% mutate(p7 = 1 - p1 - p2 - p3 - p4 - p5 - p6) %>% 
  mutate(value = 7) %>% 
  group_by(cardId) %>%
  summarize(value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p1, 1, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p2, 2, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p3, 3, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p4, 4, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p5, 5, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p6, 6, value))

test3 = test %>% left_join(test3, by = "cardId")
model3 = RMSE(test3$cost1, test3$value)
rmse_results = bind_rows(rmse_results,data_frame(method="X: attack, health",  
                                     RMSE = model3))
rmse_results
## Source: local data frame [3 x 2]
## 
##              method      RMSE
##               (chr)     (dbl)
## 1         X: attack 1.0973065
## 2    X: attplusheal 0.8451543
## 3 X: attack, health 0.8451543

This model seems to be even better since it allows the effect of health and attack to be different on the value of cost. We therefore chose to go with this model and try to adjust for additional effect imerged from card features. Since Hearthstone cards have descriptions on them and they are sometimes not quantifiable, we distinguished features that are easily quantifiable into categories. We ended up categorizing cards into Charge (cards can attack immediately once they were put on the stage), Divine Shield (cards have a protective shield that can protect them from reducing health during their first attack), Overload (specific cards for Shamman that can cause dramatic decrease in health at very early stage, but playing overload cards we limit the amount of mana players can use in the next round), Taunt (cards that can protect the hero, the opponent must attack taunts first before attacking the hero), Stealth (cards that are invisible and can not be attacked until their first attack), and Windfury (cards that can attack twice each turn). Cards that cannot be classified into these categories was then treated as normal cards and set to be reference group in the model.

## X: attack, health, mechanics(factors)
## Y: cost

fitCL = vglm(cbind(Y1, Y2, Y3, Y4, Y5, Y6, Y7) ~ health + attack + factor(mechanics1), cumulative(parallel = T, reverse = F), data = train)
summary(fitCL)
## 
## Call:
## vglm(formula = cbind(Y1, Y2, Y3, Y4, Y5, Y6, Y7) ~ health + attack + 
##     factor(mechanics1), family = cumulative(parallel = T, reverse = F), 
##     data = train)
## 
## Pearson residuals:
##                    Min        1Q    Median        3Q    Max
## logit(P[Y<=1])  -1.711 -0.150088 -0.046451 -0.009544  3.330
## logit(P[Y<=2])  -2.880 -0.217016 -0.047421  0.151582  9.053
## logit(P[Y<=3]) -12.558 -0.205327 -0.008254  0.243679  9.857
## logit(P[Y<=4])  -6.926 -0.098060  0.056029  0.179723 18.236
## logit(P[Y<=5])  -3.450  0.008930  0.041767  0.152892 21.704
## logit(P[Y<=6]) -28.417  0.008527  0.025442  0.084767 19.170
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1                    2.80272    0.31293   8.956  < 2e-16 ***
## (Intercept):2                    5.06852    0.34258  14.795  < 2e-16 ***
## (Intercept):3                    7.01085    0.40926  17.130  < 2e-16 ***
## (Intercept):4                    8.93188    0.49063  18.205  < 2e-16 ***
## (Intercept):5                   10.70372    0.57072  18.755  < 2e-16 ***
## (Intercept):6                   12.97561    0.69643  18.631  < 2e-16 ***
## health                          -1.03594    0.07504 -13.804  < 2e-16 ***
## attack                          -1.05071    0.07585 -13.852  < 2e-16 ***
## factor(mechanics1)Charge        -2.07519    0.54697  -3.794 0.000148 ***
## factor(mechanics1)Divine Shield -1.18730    0.77100  -1.540 0.123574    
## factor(mechanics1)Overload       2.01039    1.11281   1.807 0.070827 .  
## factor(mechanics1)Stealth        0.19812    0.66840   0.296 0.766918    
## factor(mechanics1)Taunt          0.36004    0.38950   0.924 0.355292    
## factor(mechanics1)Windfury       0.41925    0.86627   0.484 0.628411    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Number of linear predictors:  6 
## 
## Dispersion Parameter for cumulative family:   1
## 
## Residual deviance: 1039.403 on 2620 degrees of freedom
## 
## Log-likelihood: -519.7015 on 2620 degrees of freedom
## 
## Number of iterations: 7 
## 
## Exponentiated coefficients:
##                          health                          attack 
##                       0.3548941                       0.3496877 
##        factor(mechanics1)Charge factor(mechanics1)Divine Shield 
##                       0.1255327                       0.3050445 
##      factor(mechanics1)Overload       factor(mechanics1)Stealth 
##                       7.4662076                       1.2191085 
##         factor(mechanics1)Taunt      factor(mechanics1)Windfury 
##                       1.4333910                       1.5208129

From the above output, we can see that after adjusting for health and attack, charge and overload are the two features that likely influenced the overall valuation model. We then considered a model which included 1) Charge, and 2) Charge and Overload.

## X: attack, health, charge
## Y: cost

fitCL = vglm(cbind(Y1, Y2, Y3, Y4, Y5, Y6, Y7) ~ health + attack + Charge, cumulative(parallel = T, reverse = F), data = train)
# summary(fitCL)


for(i in 1: 6){
  assign(paste("coef",i, sep = ""), as.data.frame((coef(fitCL, matrix = T)[,i])))
}


# To estimate the cost of cards based on attack, health, and charge:
test3 = test %>% mutate(p1 = as.numeric(exp(coef1[1,]+coef1[2,]*health+coef1[3,]*attack+coef1[4,]*Charge)/(1+exp(coef1[1,]+coef1[2,]*health+coef1[3,]*attack+coef1[4,]*Charge)))) %>%
  mutate(p2 = as.numeric(exp(coef2[1,]+coef2[2,]*health+coef2[3,]*attack+coef1[4,]*Charge)/(1+exp(coef2[1,]+coef2[2,]*health+coef2[3,]*attack+coef1[4,]*Charge))) - p1) %>%
  mutate(p3 = as.numeric(exp(coef3[1,]+coef3[2,]*health+coef3[3,]*attack+coef1[4,]*Charge)/(1+exp(coef3[1,]+coef3[2,]*health+coef3[3,]*attack+coef1[4,]*Charge))) - p1 - p2) %>%
  mutate(p4 = as.numeric(exp(coef4[1,]+coef4[2,]*health+coef4[3,]*attack+coef1[4,]*Charge)/(1+exp(coef4[1,]+coef4[2,]*health+coef4[3,]*attack+coef1[4,]*Charge))) - p1 - p2 - p3) %>%
  mutate(p5 = as.numeric(exp(coef5[1,]+coef5[2,]*health+coef5[3,]*attack+coef1[4,]*Charge)/(1+exp(coef5[1,]+coef5[2,]*health+coef5[3,]*attack+coef1[4,]*Charge))) - p1 - p2 - p3 - p4) %>%
  mutate(p6 = as.numeric(exp(coef6[1,]+coef6[2,]*health+coef6[3,]*attack+coef1[4,]*Charge)/(1+exp(coef6[1,]+coef6[2,]*health+coef6[3,]*attack+coef1[4,]*Charge))) - p1 - p2 - p3 - p4 - p5) %>% mutate(p7 = 1 - p1 - p2 - p3 - p4 - p5 - p6) %>% 
  mutate(value = 7) %>% 
  group_by(cardId) %>%
  summarize(value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p1, 1, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p2, 2, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p3, 3, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p4, 4, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p5, 5, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p6, 6, value))

test3 = test %>%
  select(cost, cost1, attack, health, cardId, playerClass, mechanics) %>% 
  left_join(test3, by = "cardId")
model3 = RMSE(test3$cost1, test3$value)
rmse_results = bind_rows(rmse_results,data_frame(method="X: attack, health, charge",  
                                     RMSE = model3))
rmse_results
## Source: local data frame [4 x 2]
## 
##                      method      RMSE
##                       (chr)     (dbl)
## 1                 X: attack 1.0973065
## 2            X: attplusheal 0.8451543
## 3         X: attack, health 0.8451543
## 4 X: attack, health, charge 0.8806306
## X: attack, health, charge, and overload
## Y: cost

fitCL = vglm(cbind(Y1, Y2, Y3, Y4, Y5, Y6, Y7) ~ health + attack + Charge + Overload, cumulative(parallel = T, reverse = F), data = train)
# summary(fitCL)


for(i in 1: 6){
  assign(paste("coef",i, sep = ""), as.data.frame((coef(fitCL, matrix = T)[,i])))
}


# To estimate the cost of cards based on attack, health, charge and overload:
test3 = test %>% mutate(p1 = as.numeric(exp(coef1[1,]+coef1[2,]*health+coef1[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload)/(1+exp(coef1[1,]+coef1[2,]*health+coef1[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload)))) %>%
  mutate(p2 = as.numeric(exp(coef2[1,]+coef2[2,]*health+coef2[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload)/(1+exp(coef2[1,]+coef2[2,]*health+coef2[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload))) - p1) %>%
  mutate(p3 = as.numeric(exp(coef3[1,]+coef3[2,]*health+coef3[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload)/(1+exp(coef3[1,]+coef3[2,]*health+coef3[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload))) - p1 - p2) %>%
  mutate(p4 = as.numeric(exp(coef4[1,]+coef4[2,]*health+coef4[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload)/(1+exp(coef4[1,]+coef4[2,]*health+coef4[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload))) - p1 - p2 - p3) %>%
  mutate(p5 = as.numeric(exp(coef5[1,]+coef5[2,]*health+coef5[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload)/(1+exp(coef5[1,]+coef5[2,]*health+coef5[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload))) - p1 - p2 - p3 - p4) %>%
  mutate(p6 = as.numeric(exp(coef6[1,]+coef6[2,]*health+coef6[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload)/(1+exp(coef6[1,]+coef6[2,]*health+coef6[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload))) - p1 - p2 - p3 - p4 - p5) %>% mutate(p7 = 1 - p1 - p2 - p3 - p4 - p5 - p6) %>% 
  mutate(value = 7) %>% 
  group_by(cardId) %>%
  summarize(value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p1, 1, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p2, 2, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p3, 3, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p4, 4, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p5, 5, value),
              value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p6, 6, value))

test3 = test %>%
  select(cost, cost1, attack, health, cardId, playerClass, mechanics) %>% 
  left_join(test3, by = "cardId")
model3 = RMSE(test3$cost1, test3$value)
rmse_results = bind_rows(rmse_results,data_frame(method="X: attack, health, charge, overload",  
                                     RMSE = model3))
rmse_results
## Source: local data frame [5 x 2]
## 
##                                method      RMSE
##                                 (chr)     (dbl)
## 1                           X: attack 1.0973065
## 2                      X: attplusheal 0.8451543
## 3                   X: attack, health 0.8451543
## 4           X: attack, health, charge 0.8806306
## 5 X: attack, health, charge, overload 0.8921426

We ended up using a model with the following covariates: health, attack, charge, overload:

final = minions_text1 %>% 
  select(cardId, name, cost,cost1,mechanics,Charge, Overload, attack, health) %>%
  rbind(highcost_card) %>%
  mutate(p1 = as.numeric(exp(coef1[1,]+coef1[2,]*health+coef1[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload)/(1+exp(coef1[1,]+coef1[2,]*health+coef1[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload)))) %>%
  mutate(p2 = as.numeric(exp(coef2[1,]+coef2[2,]*health+coef2[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload)/(1+exp(coef2[1,]+coef2[2,]*health+coef2[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload))) - p1) %>%
  mutate(p3 = as.numeric(exp(coef3[1,]+coef3[2,]*health+coef3[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload)/(1+exp(coef3[1,]+coef3[2,]*health+coef3[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload))) - p1 - p2) %>%
  mutate(p4 = as.numeric(exp(coef4[1,]+coef4[2,]*health+coef4[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload)/(1+exp(coef4[1,]+coef4[2,]*health+coef4[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload))) - p1 - p2 - p3) %>%
  mutate(p5 = as.numeric(exp(coef5[1,]+coef5[2,]*health+coef5[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload)/(1+exp(coef5[1,]+coef5[2,]*health+coef5[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload))) - p1 - p2 - p3 - p4) %>%
  mutate(p6 = as.numeric(exp(coef6[1,]+coef6[2,]*health+coef6[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload)/(1+exp(coef6[1,]+coef6[2,]*health+coef6[3,]*attack+coef1[4,]*Charge+coef1[5,]*Overload))) - p1 - p2 - p3 - p4 - p5) %>% 
  mutate(p7 = 1 - p1 - p2 - p3 - p4 - p5 - p6) %>% 
  mutate(value = 7) %>% 
  group_by(cardId, cost1, attack, health, name, mechanics) %>%
  summarize(value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p1, 1, value),
            value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p2, 2, value),
            value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p3, 3, value),
            value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p4, 4, value),
            value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p5, 5, value),
            value = ifelse(max(p1,p2,p3,p4,p5,p6,p7) == p6, 6, value))

final %>% filter(value != cost1) %>%
  mutate(resid = value - cost1) %>%
  ggplot(aes(resid, group = mechanics, fill = mechanics)) + stat_bin(aes(y = ..count..), bins = 10 , position='dodge') 

From the above plot, we can see that after adjusting for the claimed features (health, attack, charge, overload) of cards, some of the cards were overvalued (resid = estimated cost - assigned < 0) and some of them were undervalued (resid > 0). Cards that are Battlecry (cards that launch certain descriptive effects when cards are played), Normal (cards that do not have any special effects), and Deathrattle (cards that launch certain descriptive effects when cards are dead) are more frequently over- or under-valued.

Simulation

2. What is the balance of choice between low cost cards and high cost cards? Accoring to our previous model and our exprience with HearthStone, the higher the cost of a card, the more powerful it is. However, the higher the cost of a card, the less likely we can play in the first few rounds (due to limited mana in hand). Therefore, we intended to reach a balance on choosing low/high cost cards. To simulate this, we made the following assumptions: Assumption 1: Players will not play zero cost cards in the first few rounds. Assumption 2: Cost can roughly represent the value of the card, thus we can maximize the cost of all 30 cards to maximize their values. Assumption 3: We focus on the first 5 turns, since the card costs are mostly less than 6 (76.3% of cards).

First, create decks with all reasonable combinations of small cards (cost 1-5) and others.

decks <- expand.grid(n1=0:6, n2=0:6, n3=0:6, n4=0:6, n5=0:6)
decks <- decks %>% tbl_df %>% mutate(others = 30-n1-n2-n3-n4-n5)

Next, use similation to estimate the probability to use card in the first 1/2/3/4/5-turn for each deck. Estimations are made for offensive player, as the defensive player has higher possiblity to use cards (4 cards at the begining with a special 0 cost card that temporatily increases the mana by 1) for the first few turns.

prob_usecard <- function(deck){
        card <- rep(c(1,2,3,4,5,10), deck)
                
        # offensive player
        temp <- t(replicate(1000,sample(card,30)))
        # assume choosing the 3 smallest cards for the starting hand
        sortcard <- t(apply(temp[,1:6],1,sort))
        temp[,1:6] <- sortcard
        sortcard2 <- t(apply(temp[,4:30],1,function(x){sample(x,27)}))
        temp[,4:30] <- sortcard2
        rm(sortcard)
        rm(sortcard2)
        
        # p1: can use card in the first turn
        p1 <- mean(apply(temp[,1:4],1,function(c){as.numeric(sum(c<2)>0)}))  
        # p2: can use card in the first 2 turns
        p2 <- mean(apply(temp[,1:5],1,function(c){as.numeric(sum(c<3)>0)}))  
        # p3: can use card in the first 3 turns
        p3 <- mean(apply(temp[,1:6],1,function(c){as.numeric(sum(c<4)>0)}))  
        # p4: can use card in the first 4 turns
        p4 <- mean(apply(temp[,1:7],1,function(c){as.numeric(sum(c<5)>0)}))  
        # p5: can use card in the first 5 turns
        p5 <- mean(apply(temp[,1:8],1,function(c){as.numeric(sum(c<6)>0)}))  
        
        c(p1, p2, p3, p4, p5)     
}

# get the probability of using card and combine
usecard <- t(apply(decks,1,prob_usecard))
colnames(usecard) <- c("p1","p2","p3","p4","p5")
decks <- cbind(decks,usecard) %>% 
        # add the total cost for each deck
        mutate(sum = n1+2*n2+3*n3+4*n4+5*n5+10*others) 
rm(usecard)

# save simulation results 
write.csv(decks,file="/Users/Yinnan/Desktop/2016/HearthScience/simulation.csv")
# get the simulation result from github
url <- "https://raw.githubusercontent.com/jihua0125/HearthScience/master/simulation.csv"
decks <- read_csv(url)

decks <- decks[,-1]

# constrain on probability of using card
# fast tempo
decks.fast <- decks %>% tbl_df %>% filter(p4>0.99, p2>0.9, p3>0.95, others>10) %>%
        arrange(desc(sum)) 

decks.fast %>% summarize(min2 = min(n1+n2), min3 = min(n1+n2+n3), min4 = min(n1+n2+n3+n4))
## Source: local data frame [1 x 3]
## 
##    min2  min3  min4
##   (int) (int) (int)
## 1     7     7     9
# mid tempo
decks.mid <- decks %>% tbl_df %>% filter(p4>0.95, p2>0.8, p3>0.9, others>10) %>%
        arrange(desc(sum)) 

decks.mid %>% summarize(min2 = min(n1+n2), min3 = min(n1+n2+n3), min4 = min(n1+n2+n3+n4))
## Source: local data frame [1 x 3]
## 
##    min2  min3  min4
##   (int) (int) (int)
## 1     5     6     7
# slow tempo
decks.slow <- decks %>% tbl_df %>% filter(p4>0.95, p2>0.5, p3>0.8, others>10) %>%
        arrange(desc(sum)) 

decks.slow %>% summarize(min2 = min(n1+n2), min3 = min(n1+n2+n3), min4 = min(n1+n2+n3+n4))
## Source: local data frame [1 x 3]
## 
##    min2  min3  min4
##   (int) (int) (int)
## 1     2     5     7

The simulation results shows the number of low-cost cards required in all situations of different play mode (slow, mid and fast-tempo).

In fast-tempo play mode, we need at least 7 cards with cost no more than 2, at least 9 cards with cost no more than 4.

In mid-tempo play mode, we need at least 5 cards with cost no more than 5, at least 6 cards with cost no more than 3, and at least 7 cards with cost no more than 4.

In slow-tempo play mode, we need at least 2 cards with cost no more than 2, at least 5 cards with cost no more than 3, and at least 7 cards with cost no more than 4.

Deck Analysis

3. Are there any “core” combination of cards? #### Look into the deck-specific features

From our empirical knowledge, we know that each deck has its own strategy to win the game, depending on the hero mode and the play tempo (fast/slow). The strategies include aggro, control, midrange, face, etc. These strategies are highly related to the average cost of all the minions inside the deck.

minions<-read.csv("minions.csv",sep="\t")
weapons<-read.csv("weapons.csv",sep="\t")
spells<-read.csv("spells.csv",sep="\t")
cards<-rbind(minions,weapons,spells)
# load("D:/HSPH/BIO 260/final/data/minions_text.RData")

classes<-c("druid","hunter","mage","paladin","priest","rogue","shaman","warlock","warrior")
decks<-list()
heroDeckLists<-list()
for(i in 1:9){
  filename<-paste(classes[i],"decks.csv",sep="")
  heroDeckLists[[i]]<-read.csv(filename,sep="\t")
  decks[[i]]<-heroDeckLists[[i]]%>%gather(deckId,cardCounts,2:(length(heroDeckLists[[i]])-1))
}

###warlock deck
warlockDeckCost<-decks[[8]]%>%filter(cardCounts!=0)%>%
  left_join(cards,by=c("cardNames"="name"))%>%
  filter(type=="Minion")%>%
  group_by(deckId)%>%
  mutate(cardTotalCost=cost*cardCounts)%>%
  mutate(aveCost=mean(cardTotalCost))%>%
  ungroup()

warlockDeckCost%>%select(deckId,aveCost)%>%distinct()%>%
  ggplot(aes(x=aveCost))+geom_histogram(binwidth = 0.5)+ggtitle("Warlock deck distribution")

###paladin deck
paladinDeckCost<-decks[[4]]%>%filter(cardCounts!=0)%>%
  left_join(cards,by=c("cardNames"="name"))%>%
  filter(type=="Minion")%>%
  group_by(deckId)%>%
  mutate(cardTotalCost=cost*cardCounts)%>%
  mutate(aveCost=mean(cardTotalCost))%>%
  ungroup()

paladinDeckCost%>%select(deckId,aveCost)%>%distinct()%>%
  ggplot(aes(x=aveCost))+geom_histogram(binwidth = 0.5)+ggtitle("Paladin deck distribution")

###druid deck
druidDeckCost<-decks[[1]]%>%filter(cardCounts!=0)%>%
  left_join(cards,by=c("cardNames"="name"))%>%
  filter(type=="Minion")%>%
  group_by(deckId)%>%
  mutate(cardTotalCost=cost*cardCounts)%>%
  mutate(aveCost=mean(cardTotalCost))%>%
  ungroup()

druidDeckCost%>%select(deckId,aveCost)%>%distinct()%>%
  ggplot(aes(x=aveCost))+geom_histogram(binwidth = 0.5)+ggtitle("Druid deck distribution")

###hunter deck
hunterDeckCost<-decks[[2]]%>%filter(cardCounts!=0)%>%
  left_join(cards,by=c("cardNames"="name"))%>%
  filter(type=="Minion")%>%
  group_by(deckId)%>%
  mutate(cardTotalCost=cost*cardCounts)%>%
  mutate(aveCost=mean(cardTotalCost))%>%
  ungroup()

hunterDeckCost%>%select(deckId,aveCost)%>%distinct()%>%
  ggplot(aes(x=aveCost))+geom_histogram(binwidth = 0.5)+ggtitle("Hunter deck distribution")

###Mage deck
mageDeckCost<-decks[[3]]%>%filter(cardCounts!=0)%>%
  left_join(cards,by=c("cardNames"="name"))%>%
  filter(type=="Minion")%>%
  group_by(deckId)%>%
  mutate(cardTotalCost=cost*cardCounts)%>%
  mutate(aveCost=mean(cardTotalCost))%>%
  ungroup()

mageDeckCost%>%select(deckId,aveCost)%>%distinct()%>%
  ggplot(aes(x=aveCost))+geom_histogram(binwidth = 0.5)+ggtitle("Mage deck distribution")

###Priest deck
priestDeckCost<-decks[[5]]%>%filter(cardCounts!=0)%>%
  left_join(cards,by=c("cardNames"="name"))%>%
  filter(type=="Minion")%>%
  group_by(deckId)%>%
  mutate(cardTotalCost=cost*cardCounts)%>%
  mutate(aveCost=mean(cardTotalCost))%>%
  ungroup()

priestDeckCost%>%select(deckId,aveCost)%>%distinct()%>%
  ggplot(aes(x=aveCost))+geom_histogram(binwidth = 0.5)+ggtitle("Priest deck distribution")

##Rogue deck
rogueDeckCost<-decks[[6]]%>%filter(cardCounts!=0)%>%
  left_join(cards,by=c("cardNames"="name"))%>%
  filter(type=="Minion")%>%
  group_by(deckId)%>%
  mutate(cardTotalCost=cost*cardCounts)%>%
  mutate(aveCost=mean(cardTotalCost))%>%
  ungroup()

rogueDeckCost%>%select(deckId,aveCost)%>%distinct()%>%
  ggplot(aes(x=aveCost))+geom_histogram(binwidth = 0.5)+ggtitle("Rogue deck distribution")

###Shaman
shamanDeckCost<-decks[[7]]%>%filter(cardCounts!=0)%>%
  left_join(cards,by=c("cardNames"="name"))%>%
  filter(type=="Minion")%>%
  group_by(deckId)%>%
  mutate(cardTotalCost=cost*cardCounts)%>%
  mutate(aveCost=mean(cardTotalCost))%>%
  ungroup()

shamanDeckCost%>%select(deckId,aveCost)%>%distinct()%>%
  ggplot(aes(x=aveCost))+geom_histogram(binwidth = 0.5)+ggtitle("Shaman deck distribution")

###Warrior deck
warriorDeckCost<-decks[[9]]%>%filter(cardCounts!=0)%>%
  left_join(cards,by=c("cardNames"="name"))%>%
  filter(type=="Minion")%>%
  group_by(deckId)%>%
  mutate(cardTotalCost=cost*cardCounts)%>%
  mutate(aveCost=mean(cardTotalCost))%>%
  ungroup()

warriorDeckCost%>%select(deckId,aveCost)%>%distinct()%>%
  ggplot(aes(x=aveCost))+geom_histogram(binwidth = 0.5)+ggtitle("Warrior deck distribution")

From the histogram we can see warlock is quite different from other heros, the distribution of the costs of decks has double peaks, while others are more likely following a normal distribution. This finding gives us a suggestion to explore data furtherly.

Explore Card Interactions within Warlock Decks

We have looked at the correlation between the cards within warlock decks. We calculated the correlations between each cards.

data<-read.csv("correlation.csv")
colnames(data)<-gsub("\\."," ",colnames(data))
#warlockDecks<-heroDeckLists[[8]]
#rownames(warlockDecks)<-t(warlockDecks[,1])
#data<-warlockDecks%>%select(-cardNames)

#calculate correlation matrix
corMatrix<-cor(x=data)
hClust<-hclust(dist(t(data)),method="complete")
plot(hClust,cex=0.6)

pc<-prcomp(corMatrix)
summary(pc)
## Importance of components:
##                           PC1     PC2     PC3    PC4     PC5    PC6
## Standard deviation     1.9738 0.81496 0.69388 0.6568 0.53860 0.4793
## Proportion of Variance 0.5581 0.09514 0.06897 0.0618 0.04156 0.0329
## Cumulative Proportion  0.5581 0.65324 0.72221 0.7840 0.82557 0.8585
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.39575 0.35725 0.33647 0.29937 0.28180 0.24043
## Proportion of Variance 0.02244 0.01828 0.01622 0.01284 0.01138 0.00828
## Cumulative Proportion  0.88091 0.89919 0.91541 0.92824 0.93962 0.94790
##                           PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.23998 0.18815 0.18349 0.16857 0.16221 0.14772
## Proportion of Variance 0.00825 0.00507 0.00482 0.00407 0.00377 0.00313
## Cumulative Proportion  0.95615 0.96122 0.96605 0.97012 0.97389 0.97701
##                           PC19    PC20    PC21    PC22    PC23    PC24
## Standard deviation     0.14260 0.12982 0.12585 0.11783 0.11170 0.10653
## Proportion of Variance 0.00291 0.00241 0.00227 0.00199 0.00179 0.00163
## Cumulative Proportion  0.97993 0.98234 0.98461 0.98660 0.98838 0.99001
##                           PC25    PC26    PC27    PC28    PC29    PC30
## Standard deviation     0.10423 0.09051 0.08557 0.08310 0.07612 0.06852
## Proportion of Variance 0.00156 0.00117 0.00105 0.00099 0.00083 0.00067
## Cumulative Proportion  0.99157 0.99274 0.99379 0.99478 0.99561 0.99628
##                           PC31    PC32    PC33    PC34    PC35    PC36
## Standard deviation     0.06737 0.06297 0.05612 0.05256 0.04522 0.03921
## Proportion of Variance 0.00065 0.00057 0.00045 0.00040 0.00029 0.00022
## Cumulative Proportion  0.99693 0.99750 0.99795 0.99835 0.99864 0.99886
##                           PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.03380 0.03334 0.03195 0.03020 0.02749 0.02258
## Proportion of Variance 0.00016 0.00016 0.00015 0.00013 0.00011 0.00007
## Cumulative Proportion  0.99902 0.99918 0.99933 0.99946 0.99957 0.99964
##                           PC43    PC44    PC45    PC46    PC47    PC48
## Standard deviation     0.02142 0.01927 0.01714 0.01688 0.01367 0.01317
## Proportion of Variance 0.00007 0.00005 0.00004 0.00004 0.00003 0.00002
## Cumulative Proportion  0.99971 0.99976 0.99980 0.99984 0.99987 0.99989
##                           PC49    PC50    PC51     PC52     PC53     PC54
## Standard deviation     0.01275 0.01135 0.01054 0.009396 0.008152 0.007593
## Proportion of Variance 0.00002 0.00002 0.00002 0.000010 0.000010 0.000010
## Cumulative Proportion  0.99992 0.99994 0.99995 0.999960 0.999970 0.999980
##                            PC55    PC56     PC57     PC58     PC59
## Standard deviation     0.005803 0.00479 0.004449 0.003708 0.003085
## Proportion of Variance 0.000000 0.00000 0.000000 0.000000 0.000000
## Cumulative Proportion  0.999990 0.99999 0.999990 0.999990 1.000000
##                            PC60     PC61     PC62     PC63     PC64
## Standard deviation     0.002767 0.002548 0.002223 0.001594 0.001348
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion  1.000000 1.000000 1.000000 1.000000 1.000000
##                            PC65      PC66      PC67      PC68      PC69
## Standard deviation     0.001251 0.0009424 0.0009018 0.0006618 0.0005387
## Proportion of Variance 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  1.000000 1.0000000 1.0000000 1.0000000 1.0000000
##                             PC70     PC71      PC72      PC73      PC74
## Standard deviation     0.0004562 0.000395 0.0002827 0.0002188 0.0001501
## Proportion of Variance 0.0000000 0.000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  1.0000000 1.000000 1.0000000 1.0000000 1.0000000
##                             PC75      PC76      PC77      PC78      PC79
## Standard deviation     9.085e-05 4.207e-05 6.774e-16 1.884e-16 1.884e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC80      PC81      PC82      PC83      PC84
## Standard deviation     1.884e-16 1.884e-16 1.884e-16 1.884e-16 1.884e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC85      PC86      PC87      PC88      PC89
## Standard deviation     1.884e-16 1.884e-16 1.884e-16 1.884e-16 1.884e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC90      PC91      PC92      PC93      PC94
## Standard deviation     1.884e-16 1.884e-16 1.884e-16 1.884e-16 1.884e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC95      PC96      PC97      PC98      PC99
## Standard deviation     1.884e-16 1.884e-16 1.884e-16 1.884e-16 1.884e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC100     PC101     PC102
## Standard deviation     1.884e-16 1.884e-16 8.069e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00
data.t<-t(data)
d1<-dist(data) 
d2<-dist(data.t)
cormat<-round(cor(data.t),2)
mtscaled<-as.matrix(d1)


### triangle heatmap

source("https://raw.githubusercontent.com/briatte/ggcorr/master/ggcorr.R")
##ggcorr(cormat)
ggcorr(cormat,hjust = 0.3, size = 1, color = "grey50")

From the principle components analysis, we can see the top 2 principle components have explained 2/3 of the variance between cards and the top 9 can explain 91% of the variance. So here, we are going to use the first 9 principle components to do the following analysis to keep the scale of problem acceptable and avoid overfitting.

pcaData <-pc$x[,1:9]
pca1 <-pc$x[,1]
pca2 <-pc$x[,2]
pca3<- pc$x[,3]
pca4 <-pc$x[,4]
pca5 <-pc$x[,5]
pca6<- pc$x[,6]
pca7 <-pc$x[,7]
pca8 <-pc$x[,8]
pca9<- pc$x[,9]

wss <- (nrow(pcaData)-1)*sum(apply(pcaData,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(pcaData,centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")

kmeans.cluster<-kmeans(pcaData, centers=4)
pc.df<-data.frame(ID=names(pca1),PCA1=pca1, PCA2=pca2, PCA3=pca3,PCA4=pca4,PCA5=pca5,PCA6=pca6,PCA7=pca7,PCA8=pca8,PCA9=pca9, Cluster=factor(kmeans.cluster$cluster))
pc.df%>%ggplot(aes(x=PCA1, y=PCA2, label=ID, color=Cluster))+geom_jitter()+
  geom_text_repel(aes(PCA1, PCA2, label=ID),data = filter(pc.df,PCA1 < -2.5 | PCA1 >2.5| PCA2 < -1.5 | PCA2>1.5))

total.df<-pc.df%>%left_join(cards,by=c("ID"="name"))
total.df%>%ggplot(aes(x=PCA1, y=PCA2, label=cost, color=Cluster))+geom_jitter()+geom_text_repel()

pc.df%>%group_by(Cluster)%>%summarize(n())
## Source: local data frame [4 x 2]
## 
##   Cluster   n()
##    (fctr) (int)
## 1       1    19
## 2       2    56
## 3       3    17
## 4       4    10

Clustering of cards

In the above, we have tried to use Kmeans clustering to distinguish different type of decks. By the FOM plots, we found that 4 is the balanced point, so we made a 4 centroid clustering. We randomly pick one deck to see if the clusters make sense. We can see in the following table that most of the cards fall into the same clusters.

deck<-heroDeckLists[[8]]%>%select(cardNames,X60)%>%
  filter(X60!=0)%>%
  left_join(pc.df,by=c("cardNames"="ID"))
deck[,c(1,12)]%>%kable
cardNames Cluster
Abusive Sergeant 3
Dark Peddler 3
Defender of Argus 3
Flame Imp 3
Imp Gang Boss 3
Knife Juggler 3
Voidwalker 3
Hellfire 1
Loatheb 2
Haunted Creeper 3
Nerubian Egg 3
Power Overwhelming 3
Doomguard 3
Soulfire 4
Fist of Jaraxxus 2
Leper Gnome 2
Also, we made the fol lowing correlation heatmap to see the correlations inside the clusters.
### seperate data set
fullcluster<-pc.df%>%select(-PCA1,-PCA2,-PCA3,-PCA4,-PCA5,-PCA6,-PCA7,-PCA8,-PCA9)
cluster1<-fullcluster%>%filter(Cluster=="1")%>%select(-Cluster)
cluster2<-fullcluster%>%filter(Cluster=="2")%>%select(-Cluster)
cluster3<-fullcluster%>%filter(Cluster=="3")%>%select(-Cluster)
cluster4<-fullcluster%>%filter(Cluster=="4")%>%select(-Cluster)
#conver the rownames to first column "ID"
ID<-rownames(fullcluster)
#rownames(data)<-NULL
dat<-as.data.frame(cbind(ID,t(data)))
#create 4 dataset by "ID"
dataset1<-dplyr::right_join(dat,cluster1,by="ID")
dataset2<-dplyr::right_join(dat,cluster2,by="ID")
dataset3<-dplyr::right_join(dat,cluster3,by="ID")
dataset4<-dplyr::right_join(dat,cluster4,by="ID")

#convert the first column to rownames
rownames(dataset1)<-dataset1$ID
rownames(dataset2)<-dataset2$ID
rownames(dataset3)<-dataset3$ID
rownames(dataset4)<-dataset4$ID
dataset1<-dataset1[,-1]
dataset2<-dataset2[,-1]
dataset3<-dataset3[,-1]
dataset4<-dataset4[,-1]
data1.t<-t(data.matrix(dataset1))
data2.t<-t(data.matrix(dataset2))
data3.t<-t(data.matrix(dataset3))
data4.t<-t(data.matrix(dataset4))
#correlation within the first dataset
cormat1<-round(cor(data1.t),2)
cormat2<-round(cor(data2.t),2)
cormat3<-round(cor(data3.t),2)
cormat4<-round(cor(data4.t),2)

#correlation matrix 
melted_cormat1 <- melt(cormat1)
p1<-ggplot(data = melted_cormat1, aes(X2, X1, fill = value))+
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal()

p1+ theme(axis.text.y = element_text(vjust = 1, 
                                     size = 4, hjust = 1))+theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                                                                             size = 3, hjust = 1))+scale_x_discrete(expand = c(0, 0)) + scale_y_discrete(expand = c(0, 0))+theme(legend.background=element_rect(),legend.margin=unit(1,"cm"))

melted_cormat2 <- melt(cormat2)
p2<-ggplot(data = melted_cormat2, aes(X2, X1, fill = value))+
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal()

p2+ theme(axis.text.y = element_text(vjust = 1, 
                                     size = 4, hjust = 1))+theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                                                                             size = 10, hjust = 1))+scale_x_discrete(expand = c(0, 0)) + scale_y_discrete(expand = c(0, 0))+theme(legend.background=element_rect(),legend.margin=unit(1,"cm"))

melted_cormat3 <- melt(cormat3)
p3<-ggplot(data = melted_cormat3, aes(X2, X1, fill = value))+
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal()

p3+ theme(axis.text.y = element_text(vjust = 1, 
                                     size = 10, hjust = 1))+theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                                                                             size = 10, hjust = 1))+scale_x_discrete(expand = c(0, 0)) + scale_y_discrete(expand = c(0, 0))+theme(legend.background=element_rect(),legend.margin=unit(1,"cm"))

melted_cormat4 <- melt(cormat4)
p4<-ggplot(data = melted_cormat4, aes(X2, X1, fill = value))+
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal()

p4+ theme(axis.text.y = element_text(vjust = 1, 
                                     size = 10, hjust = 1))+theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                                                                             size = 10, hjust = 1))+scale_x_discrete(expand = c(0, 0)) + scale_y_discrete(expand = c(0, 0))+theme(legend.background=element_rect(),legend.margin=unit(1,"cm"))

Let’s look at the card frequency distribution.

freqTable<-heroDeckLists[[8]]%>%tbl_df()%>%
  mutate(cardTotalCounts=rowSums(heroDeckLists[[8]][,2:length(heroDeckLists[[8]])]))%>%
  dplyr::select(cardNames,cardTotalCounts)
total.df<-total.df%>%left_join(freqTable,by=c("ID"="cardNames"))
total.df%>%dplyr::select(ID,cardTotalCounts,Cluster)%>%filter(complete.cases(.))%>%
  ggplot(aes(Cluster,cardTotalCounts))+geom_point()

From the above plots, we can see that the cards in cluster 1 and 3 are more frequent appear in decks. This helps us to select the core cards of a deck. A core card should neither appear too much, which makes it look like panacea; nor appear too little, which means it has fewer interaction with other cards.

coreTable<-total.df%>%filter(type=="Minion")%>%dplyr::select(ID,cardTotalCounts,Cluster,cost)%>%filter(complete.cases(.))%>%
  filter(cardTotalCounts<90&cardTotalCounts>60)

coreTable%>%group_by(Cluster)%>%summarize(n())
## Source: local data frame [2 x 2]
## 
##   Cluster   n()
##    (fctr) (int)
## 1       1     6
## 2       3     7

Now, in each cluster, we have several numbers of core cards. But 6 and 7 core cards are a bit too many. So let’s do a simulation of how numbers of core cards affect the probability of getting all the core cards after drawing certain amount of cards.

Number of Draw cards

For each deck, there are several “core” cards that can have the greatest effect when they are used together. We will usually put 2 cards for each component of core cards, and we want to get at least one for every component as early as possible.

First we list all possible decks with core cards and normal cards. Each set of core cards includes 2-5 different components. We consider the offensive side/early hand first.

# sort the first 6 card for offensive side/early hand, assume we will always keep the core card
sort.offensive <- function(tmp){
        sortcard <- t(apply(tmp[,1:6],1,function(x){sort(x,decreasing = T)}))
        tmp[,1:6] <- sortcard
        sortcard2 <- t(apply(tmp[,4:30],1,function(x){sample(x,27)}))
        tmp[,4:30] <- sortcard2
        tmp
}

# 2 components core cards set, each with 2 cards
card <- c(1,1,2,2,rep(0,26))
tmp <- t(replicate(10000,sample(card,30)))

tmp <- sort.offensive(tmp)

# the probability of having the complete set of core cards by n turns
# we want to know the probability by each turn i

offen_core2 <- function(i){
        mean(apply(tmp[,1:(i+3)],1,function(c){as.numeric(sum(c==1)>0 & sum(c==2)>0)})) 
}

o2 <- sapply(1:27,offen_core2)

# 3 components core cards set, each with 2 cards
card <- c(1,1,2,2,3,3,rep(0,24))
tmp <- t(replicate(10000,sample(card,30)))

tmp <- sort.offensive(tmp)

# the probability of having the complete set of core cards by n turns
# we want to know the probability by each turn i
offen_core3 <- function(i){
        mean(apply(tmp[,1:(i+3)],1,function(c){as.numeric(sum(c==1)>0 & sum(c==2)>0 & sum(c==3)>0)})) 
}

o3 <- sapply(1:27,offen_core3)


# 4 components core cards set, each with 2 cards
card <- c(1,1,2,2,3,3,4,4,rep(0,22))
tmp <- t(replicate(10000,sample(card,30)))

tmp <- sort.offensive(tmp)

# the probability of having the complete set of core cards by n turns
# we want to know the probability by each turn i
offen_core4 <- function(i){
        mean(apply(tmp[,1:(i+3)],1,function(c){as.numeric(sum(c==1)>0 & sum(c==2)>0 & sum(c==3)>0) & sum(c==4)>0})) 
}

o4 <- sapply(1:27,offen_core4) 


# 5 components core card set
card <- c(1,1,2,2,3,3,4,4,5,5,rep(0,20))
tmp <- t(replicate(10000,sample(card,30)))

tmp <- sort.offensive(tmp)

# the probability of having the complete set of core cards by n turns
# we want to know the probability by each turn i
offen_core5 <- function(i){
        mean(apply(tmp[,1:(i+3)],1,function(c){as.numeric(sum(c==1)>0 & sum(c==2)>0 & sum(c==3)>0) & sum(c==4)>0 & sum(c==5)>0})) 
}

o5 <- sapply(1:27,offen_core5) 

# 6 components core card set
card <- c(1,1,2,2,3,3,4,4,5,5,6,6,rep(0,18))
tmp <- t(replicate(10000,sample(card,30)))

tmp <- sort.offensive(tmp)

# the probability of having the complete set of core cards by n turns
# we want to know the probability by each turn i
offen_core6 <- function(i){
        mean(apply(tmp[,1:(i+3)],1,function(c){as.numeric(sum(c==1)>0 & sum(c==2)>0 & sum(c==3)>0) & sum(c==4)>0 & sum(c==5)>0 & sum(c==6)>0})) 
}

o6 <- sapply(1:27,offen_core6) 

# 7 components core card set
card <- c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,rep(0,18))
tmp <- t(replicate(10000,sample(card,30)))

tmp <- sort.offensive(tmp)

# the probability of having the complete set of core cards by n turns
# we want to know the probability by each turn i
offen_core7 <- function(i){
        mean(apply(tmp[,1:(i+3)],1,function(c){as.numeric(sum(c==1)>0 & sum(c==2)>0 & sum(c==3)>0) & sum(c==4)>0 & sum(c==5)>0 & sum(c==6)>0 & sum(c==7)>0})) 
}

o7 <- sapply(1:27,offen_core7) 

Similarly we can estimate the probability for the defensive side/late hand.

# sort the first 6 card for offensive side, assume we will always keep the core card
sort.defensive <- function(tmp){
        sortcard <- t(apply(tmp[,1:8],1,function(x){sort(x,decreasing = T)}))
        tmp[,1:8] <- sortcard
        sortcard2 <- t(apply(tmp[,5:30],1,function(x){sample(x,26)}))
        tmp[,5:30] <- sortcard2
        tmp
}

# 2 components core cards set, each with 2 cards
card <- c(1,1,2,2,rep(0,26))
tmp <- t(replicate(10000,sample(card,30)))

tmp <- sort.defensive(tmp)

# the probability of having the complete set of core cards by n turns
# we want to know the probability by each turn i

defen_core2 <- function(i){
        mean(apply(tmp[,1:(i+4)],1,function(c){as.numeric(sum(c==1)>0 & sum(c==2)>0)})) 
}

d2 <- sapply(1:26,defen_core2)

# 3 components core cards set, each with 2 cards
card <- c(1,1,2,2,3,3,rep(0,24))
tmp <- t(replicate(10000,sample(card,30)))

tmp <- sort.defensive(tmp)

# the probability of having the complete set of core cards by n turns
# we want to know the probability by each turn i
defen_core3 <- function(i){
        mean(apply(tmp[,1:(i+4)],1,function(c){as.numeric(sum(c==1)>0 & sum(c==2)>0 & sum(c==3)>0)})) 
}

d3 <- sapply(1:26,defen_core3)

# 4 components core cards set, each with 2 cards
card <- c(1,1,2,2,3,3,4,4,rep(0,22))
tmp <- t(replicate(10000,sample(card,30)))

tmp <- sort.defensive(tmp)

# the probability of having the complete set of core cards by n turns
# we want to know the probability by each turn i
defen_core4 <- function(i){
        mean(apply(tmp[,1:(i+3)],1,function(c){as.numeric(sum(c==1)>0 & sum(c==2)>0 & sum(c==3)>0) & sum(c==4)>0})) 
}

d4 <- sapply(1:26,defen_core4) 


# 5 components core card set
card <- c(1,1,2,2,3,3,4,4,5,5,rep(0,20))
tmp <- t(replicate(10000,sample(card,30)))

tmp <- sort.defensive(tmp)

# the probability of having the complete set of core cards by n turns
# we want to know the probability by each turn i
defen_core5 <- function(i){
        mean(apply(tmp[,1:(i+3)],1,function(c){as.numeric(sum(c==1)>0 & sum(c==2)>0 & sum(c==3)>0) & sum(c==4)>0 & sum(c==5)>0})) 
}

d5 <- sapply(1:26,defen_core5) 

# 6 components core card set
card <- c(1,1,2,2,3,3,4,4,5,5,6,6,rep(0,18))
tmp <- t(replicate(10000,sample(card,30)))

tmp <- sort.defensive(tmp)

# the probability of having the complete set of core cards by n turns
# we want to know the probability by each turn i
defen_core6 <- function(i){
        mean(apply(tmp[,1:(i+3)],1,function(c){as.numeric(sum(c==1)>0 & sum(c==2)>0 & sum(c==3)>0) & sum(c==4)>0 & sum(c==5)>0 & sum(c==6)>0})) 
}

d6 <- sapply(1:26,defen_core6)

# 7 components core card set
card <- c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,rep(0,16))
tmp <- t(replicate(10000,sample(card,30)))

tmp <- sort.defensive(tmp)

# the probability of having the complete set of core cards by n turns
# we want to know the probability by each turn i
defen_core7 <- function(i){
        mean(apply(tmp[,1:(i+3)],1,function(c){as.numeric(sum(c==1)>0 & sum(c==2)>0 & sum(c==3)>0) & sum(c==4)>0 & sum(c==5)>0 & sum(c==6)>0 & sum(c==7)>0})) 
}

d7 <- sapply(1:26,defen_core7) 
# show results: the probability of getting the whole set of core cards

offensive <- data.frame(o2,o3,o4,o5,o6,o7)
colnames(offensive) <- c(2,3,4,5,6,7)
offensive <- offensive %>% mutate(turn=1:27, card=4:30) %>% gather("n_core","prob",1:6)

defensive <- data.frame(d2,d3,d4,d5,d6,d7)
colnames(defensive) <- c(2,3,4,5,6,7)
defensive <- defensive %>% mutate(turn=1:26, card=5:30) %>% gather("n_core","prob",1:6)

offensive %>% ggplot(aes(card,prob)) + geom_line(aes(color=n_core)) +
        ggtitle("Early hand") +
        scale_x_continuous(breaks=4:30) +
        scale_y_continuous(breaks=seq(0,1,0.1)) +
        geom_vline(xintercept = 13)

defensive %>% ggplot(aes(card,prob)) + geom_line(aes(color=n_core)) +
        ggtitle("Late hand") +
        scale_x_continuous(breaks=5:30) +
        scale_y_continuous(breaks=seq(0,1,0.1)) +
        geom_vline(xintercept = 14)

After we selected the candidates of core cards, now we are trying to construct a deck from what we have. According to our simulations above, we set a vertical line at Turn 10. At that point the curves of probablity of collecting all k (k=2,3,4…) core cards are seperate. Considering Warlock’s hero ability (draw one card by 2 self-damages), at Turn 10, the total cards the player might draw are from 13 to 18. We assume that the core cards are determinant to winning a game, which means that if the player collect all these cards, he will win. So if we set the cutoffs as 0.5, which means half of the games can collect all these core cards. We can see that 5 is the maxium number of core cards that satisifies our criteria. To reduce number of core cards, we used the evaluation of step 1. Here, we used the cards that are underestimated most as our core cards from the candidates. Then we calculated the distance of other cards to the center of these cards and pick the nearest cards to construct our deck.

coreTable<-coreTable%>%left_join(final,by=c("ID"="name"))
coreTable<-coreTable%>%select(-cardId)
coreTable<-coreTable%>%filter(complete.cases(.))%>%
  mutate(undervalue=value-cost)

zooCore<-coreTable%>%filter(undervalue>0 & Cluster==3)

zooDeckAvailableNumber<-30
zooDeck<-zooCore%>%select(ID)%>%mutate(count=2)

zooDeckAvailableNumber<-zooDeckAvailableNumber-dim(zooDeck)[1]*2

i<-0
while(as.numeric(zooDeckAvailableNumber)>0){
  i=i+1
  zooPCA<-zooDeck%>%left_join(total.df,by="ID")
  center<-colMeans(zooPCA[,3:11])
  neighbors<-total.df%>%
    mutate(distance=sqrt((PCA1-center[1])^2+
            (PCA2-center[2])^2+(PCA3-center[3])^2+(PCA4-center[4])^2+(PCA5-center[5])^2)+(PCA6-center[6])^2+
             (PCA7-center[7])^2+(PCA8-center[8])^2+(PCA9-center[9])^2)%>%arrange(distance)%>%
    filter(!ID %in% zooDeck[,1])
  if(zooDeckAvailableNumber==1){
    newCard<-neighbors[i,]%>%mutate(count=1)%>%select(ID,count)
  }else{
    newCard<-neighbors[i,]%>%mutate(count=ifelse(rarity!="Legendary",2,1))%>%select(ID,count)
  }
  zooDeck<-rbind(zooDeck,newCard)
  zooDeckAvailableNumber<-zooDeckAvailableNumber-as.numeric(newCard[,2])
}

Here is the deck we constructed finally.

zooDeck<-read.csv("zooDeck.csv",header = TRUE,sep=",")

zooDeck[,2:3]%>%kable
ID count
Knife Juggler 2
Voidwalker 2
Doomguard 2
Flame Imp 2
Nerubian Egg 2
Dire Wolf Alpha 2
Power Overwhelming 2
Void Terror 2
Argent Squire 2
Dark Iron Dwarf 2
Sea Giant 2
Curse of Rafaam 2
Voodoo Doctor 2
Bane of Doom 2
Leeroy Jenkins 1
Harvest Golem 1

In this deck we built, we have 8 cards with cost 1, 6 cards with cost 2, 3 cards with cost 3, 2 cards with cost 4. It satisfy our simulation of fast-tempo play mode in which we need at least 7 cards with cost no more than 2, at least 9 cards with cost no more than 4. Thus we are supposed to be able to play the cards randomly drawn from the deck with high probabilities (at least 90% probability in the first 2 turns, 95% probability in the first 3 turns, and 99% probability in the first 4 turns).

Validation

Card Value Model Validation: Blizard’s adjustment of card values:

Blzzard has recently (four days ago) released an expansion package, which includes a series of new cards. However, the inforamtion of these cards have not been released by Blizard’s API. Fortunately, they also adjusted 12 existing cards (7 of them are minions), which allowed us to validate our model. The results are shown below:

new_assigned_cards = c("Knife Juggler", "Gig Game Hunter", "Force of Nature", 
  "Molten Giant", "Arcane Golem", "Blade Flurry", 
  "Keeper of the Grove", "Ancient of Lore", "Master of Disquise", 
  "Hunter's Mark", "Ironbeak Owl", "Leper Gnome")

final %>% filter(name %in% new_assigned_cards)
## Source: local data frame [7 x 7]
## Groups: cardId, cost1, attack, health, name [7]
## 
##     cardId cost1 attack health                name   mechanics value
##     (fctr) (dbl)  (int)  (int)              (fctr)       (chr) (dbl)
## 1  CS2_203     2      2      1        Ironbeak Owl   Battlecry     2
## 2  EX1_029     1      2      1         Leper Gnome Deathrattle     2
## 3  EX1_089     3      4      2        Arcane Golem      Charge     4
## 4  EX1_166     4      2      4 Keeper of the Grove      Normal     3
## 5  EX1_620     7      8      8        Molten Giant      Normal     7
## 6 NEW1_008     7      5      5     Ancient of Lore      Normal     5
## 7 NEW1_019     2      3      2       Knife Juggler      Normal     3

Our model predicted 3 out of these 7 cards correctly, in which they are underestimate (they are more powerful than the values they were assigned). After Blizard’s adjustment, these cards become weakened. However, we also made wrong prediction on three cards: Ironbeak Owl, Keeper of Grove and Ancient of Lore. The first two cards have special ability with silence effect, which is not adjusted into our model since there are only three cards with this ability. The Ancient of Lore is also misvalued by our model, this is because it has the ability of Choice, and Choice is also a uncommon ability so it has too few cards for our model to predict. Also, in our model, we consider all the cards with more than 7 costs as 7 cost (be consistent with the categories in HearthStone game). So for Molten Giant, it has a special mechanism of changing costs according to the hero’s health, which makes it have a changeable cost. This is not suitable for our model (we can only estimate minions with a fixed cost).

Real-time evaluation of zooDeck:

After we built this deck, we tried to fight against AI first. The results are showed below. Generally speaking, our established deck performed pretty well in players vs. computer games.

read.csv("practice.csv",sep="\t",header=T)
##    Match    Hero  Hand Result Cost
## 1      1 Warlock early      W    8
## 2      2 Warlock early      W    9
## 3      3    Mage  late      W    7
## 4      4    Mage early      W    7
## 5      5  Hunter  late      W    7
## 6      6  Hunter early      W    7
## 7      7 Warrior early      W    7
## 8      8 Warrior  late      W    8
## 9      9  Shaman  late      W    9
## 10    10  Shaman  late      L    8
## 11    11   Druid  late      W    6
## 12    12   Druid  late      W    6
## 13    13  Priest early      W    8
## 14    14  Priest  late      W    8
## 15    15   Rogue early      W    7
## 16    16   Rogue early      W    8
## 17    17 Paladin  late      W    6
## 18    18 Paladin early      L    8

After we played agint AI, we want to see the deck’s performance in real games (Player vs. Player). We asked three different players to play with this deck for 5 games and record the results. The final results are showed below. The average winning rate is 53.3%.

read.csv("match.csv",sep="\t",header=T)
##     Match Player1 Player2 Player3
## 1 Match 1       W       W       W
## 2 Match 2       W       L       W
## 3 Match 3       W       W       L
## 4 Match 4       L       L       L
## 5 Match 5       L       L       W

Special Thanks:

Thanks to Ji’s friends Chengqi and Chenchao for doing the test for us!